Making Tensor Factorizations Robust to Non-Gaussian 

Noise* 



Eric C. Chi* 

Department of Statistics 
Rice University 
Houston, TX 15213 

echigrice . edu 



Tamara G. Kolda 1 

Sandia National Laboratories 
Livermore, CA 94551-9159 

tgkoldagsandia . gov 



Abstract 



Tensors are multi-way arrays, and the CANDECOMP/PARAFAC (CP) tensor factor- 
ization has found application in many different domains. The CP model is typically 
fit using a least squares objective function, which is a maximum likelihood estimate 
under the assumption of i.i.d. Gaussian noise. We demonstrate that this loss function 
can actually be highly sensitive to non-Gaussian noise. Therefore, we propose a loss 
function based on the 1-norm because it can accommodate both Gaussian and grossly 
non-Gaussian perturbations. We also present an alternating majorization-minimization 
algorithm for fitting a CP model using our proposed loss function. 

1 Introduction 

The CANDECOMP/PARAFAC (CP) tensor factorization can be considered a higher-order generalization 
of the matrix singular value decomposition |01|7l and has many applications. The canonical fit function 
for the CP tensor factorization is based on the Frobenius norm, meaning that it is a maximum likelihood 
estimate (MLE) under the assumption of additive i.i.d. Gaussian perturbations. It turns out, however, that 
this loss function can be very sensitive to violations in the Gaussian assumption. However, many other 
types of noise are relevant for CP models. For example, in fMRI neuroimaging studies, movement by the 
subject can lead to sparse high-intensity changes that are easily confused with brain activity [6|. Like- 
wise, in foreground/background separation problems in video surveillance, a subject walking across the 
field of view represents another instance of a sparse high intensity change 1131 . In both examples, there 
is a relatively large perturbation in magnitude that affects only a relatively small fraction of data points; 
we call this artifact noise. These scenarios are particularly challenging because the perturbed values are 
on the same scale as normal values (i.e., true brain activity signals and background pixel intensities). 
Consequently, there is a need to explore factorization methods that are robust against violations in the 
Gaussian assumption. In this paper, we consider a loss based on the 1-norm which is known to be robust 
or insensitive to gross non-Gaussian perturbations |8 1. 

Vorobyov et al. previously described two ways of solving the least 1-norm CP factorization problem 
based on a linear programming and weighted median filtering 1 14 1. Our method differs in that we use 
a majorization-minimization (MM) strategy [9|. Like lfT4ll our method performs block minimization. 
An advantage of our approach is that each block minimization can be split up into many small and 
independent optimization problems which may scale more favorably with a tensor's size. 
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Throughout, we use the following definitions and conventions. All vectors are column vectors. The 
transpose of the i th row of a matrix A is denoted by a;. The order of a tensor is the number of dimensions, 
also known as ways or modes. Fibers are the higher-order analogue of matrix rows and columns. A fiber 
is defined by fixing every index but one. A matrix column is a mode-1 fiber and a matrix row is a mode-2 
fiber. The mode-ri matricization of a tensor % 6 ^_IixI 2 x---xIn [ s denoted by X( n ) and arranges the 
mode-n fibers to be the columns of the resulting matrix. 

The rest of this paper is organized as follows. The robust iterative algorithm is derived in Section [2] In 
Section[3]we compare CPAL1 and the standard CP factorizations by alternating least squares (CPALS) 
in the presence of non-Gaussian perturbations on simulated data. Concluding remarks are given in Sec- 
tion m 



2 Majorization-minimization for tensor factorization 

MM algorithms have been applied to factorization problems previously lfT2l [3ll5l. The idea is to convert 
a hard optimization problem (e.g., non-convex, non-differentiable) into a series of simpler ones (e.g., 
smooth convex), which are easier to minimize than the original. To do so, we use majorization functions, 
i.e., h majorizes g at x e R" if h(u) > g(u) for all u £ W 1 and h(x) — g(x). 

Given a procedure for constructing a majorization, we can define the MM algorithm to find a minimizer 
of a function g as follows. Let x^ ' denotes the fc th iterate. (1) Find a majorization h(-\K^) of g at 
x^- 1 . (2) Set x( k+1 ^ = argmin x h(x\x^). (3) Repeat until convergence. This algorithm always takes 
non-increasing steps with respect to g. Moreover, sufficient conditions for the MM algorithm to converge 
to a stationary point are well known 1111 . Specifically, the MM iterates will converge to a stationary point 
of g if g is continuously differentiable, coercive in the sense that all its level sets must be compact, and 
all its stationary points are isolated; and the function h(x\y) is jointly twice continuously differentiable 
in (x, y) and is strictly convex in x with y fixed. 



2.1 Solving the l\ regression problem by an MM algorithm 

We now derive an appropriate majorization function for approximate l\ regression; this is subsequently 
used for our robust tensor factorization. The unregularized majorization can be found in ifTTIl . Given 
a vector y <E E 7 and a matrix M £ K /Xj , we search for a vector u £ R J that minimizes the loss 
L(u) = X)il r i( u )l where ^(u) — yi — mju. Note that L(u) is not smooth and may not be strictly 
convex if M is not full rank. Therefore, we instead consider the following smoothed and regularized 
version to L(u): 

i^(u)=^ V /^(u)2 +e+ |||u|| 2 , (1) 

where e and [i are small positive numbers. In this case, L C Al (u) at u £ M. J is majorized by 

h eA u\u) = E(vWTI + ^^ffl| + f Huf . (2) 
7=t I 2 v /r,(u) 2 + e J 2 

Both the loss L e ^ and its majorization h e>li meet the sufficient conditions that guarantee convergence of 
the MM algorithm to a stationary point of L e ^. Since L e>li is strictly convex and coercive, it has exactly 
one stationary point. Thus, the MM algorithm converges to the global minimum of L e „. After some 
simplification, the iterate mapping can be expressed as 

u( m+1 > = argmin/y - r ' ( " )2 + ^||u|| 2 ) . (3) 

LetW (m) £ R lxl be the diagonal matrix with (W (m) ) j;i = (r l (u( m )) 2 + e)- 1 / 2 . Then the minimization 
problem Q is a regularized weighted least squares problem with a unique solution, i.e., 

u( m+1 ) =argmin f(y - Mu) T W (m) (y - Mu) + -||u|| 2 ) = (M T W (m) M + Ml) _1 M T W (m V- 
u I 2 J 

(4) 
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2.2 Tensor factorization using l\ regression 



We now derive CPAL1 for a 3-way tensor DC of size I\ x I2 x I3 (it is straightforward to generalize the 
algorithm to tensors of arbitrary size). To perform a rank-i? factorization we minimize L e „(A, B, C) 
which is the regularized approximate 1-norm of the difference between DC and its rank-i? approximation, 
where A e R IlxR , B e K /2X - R , C g E. l3XR : 



L^(A,B,C) = 



5> ir & i2r c i3r ) + £ + f (||A||f. + ||B|| 2 F + ||C" 2 



IF J 



In a round-robin fashion, we repeatedly update one factor matrix while holding the other two fixed. 
Note that the mode-1 matricization of the rank-i? approximation is A(C B) T where denotes the 
Khatri-Rao product [ 10 1. Then the subproblem of updating A for a fixed B and C is 

fEE ]l (( x « - A < c B )"%) 2 + e + f ii a ii^ ^ 

»=i j=i 

This minimization problem is separable in the rows of A, and the optimization problem for a given row 
is an l\ regression problem. Thus, we can apply the update rule Q with y equal to the z th row of X(i) 
and M = C B. The other two subproblems are solved analogously. 

3 Simulation Experiment 

We compare the results of CPAL1 with CPALS implemented in the Tensor Toolbox [2| in the presence 
of Gaussian and artifact noise. We created 3-way tensors, DC' € Jj50x50x50 Q f xw fc_§ as follows. We 
first generated random factor matrices A, B, C <E E 50x5 where the matrix elements were the absolute 
values of i.i.d. draws from a standard Gaussian. The ijk entry of the noise free tensor DC was then 

set to be Yl r =i a iirbi 2 rCi 3r - Then to each DC we added dense Gaussian noise and artifact outliers. All 
random variables we describe were independently drawn. We generated an artifact tensor f as follows. 
A fraction rj of the tensor entries was selected randomly. We then assigned to each of the selected entries 
a value drawn from a Gamma distribution with shape parameter 50 and scale parameter 1/50. All other 
entries were set to 0. For the dense Gaussian noise tensor Q, the entries were i.i.d. draws from a 
standard Gaussian. The tensor X was obtained by adding the noise and artifact tensors to DC : 

for r] = 0.1, 0.2 and 7 = 0.5, 1.0, 1.5 and 2.0. For all combinations of 77 and 7 the scaled values of qij k 
were less than the largest value of DC. 

For every pair (77, 7) we performed 100 rank-5 factorizations under the two methods. For CPAL1 com- 
putations we set e = 10~ 10 and /j = 10~ 8 . Initial points for all tests were generated using the n-mode 
singular vectors of the tensor (i.e., the nvecs command in the Tensor Toolbox). To assess the goodness 
of a computed factorization we calculated the factor match score (FMS) between the estimated and true 
factors [1]. The FMS ranges between and 1; an FMS of 1 corresponds to a perfect recovery of the 
original factors. 



Figure la shows boxplots of the FMS under both methods. The scores for CPALS decreased as the 
contribution of non-Gaussian noise increased. In contrast regardless of the noise distributions applied 
CPAL1 tended to recover the true factorization with the exception of occasionally finding local minima, 

Figure [Tb] compares one column of one recovered factor matrix when rj = 0.2 and 7 = 2 for the two 
methods. In this instance the CPALS factorization has some trouble recovering the true factor column. In 
this example the FMS was 0.91 and 0.64 for CPAL1 and CPALS respectively. The median CPALS FMS 
was about 0.7, so the example shown is somewhat typical. The factorization is not terrible qualitatively, 
but the errors in the Factor 2 estimates do fail to capture details that CPAL1 solution does. 

4 Conclusion 

We derived a robust tensor factorization algorithm based on an approximate 1-norm loss. In compar- 
isons with methods using an LP solver we found that our method performed slightly faster on tensors 
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(b) A comparison of a single recovered factor column for a replicate when 77 = 0.2 and 7 = 2. Here 
the FMS was 0.91 and 0.64 for CPAL1 and CPALS respectively. Factor columns were normalized for 
comparison. 



Figure 1: Panel la shows on average that CPALS factorizations are sensitive to artifact noise. Panel lb 
provides a close up comparison of the differences between the two methods for single recovered column 
in an instance where CPALS is less accurate. 



of similar size to those factored in the simulation experiments of this paper (not shown). We suspect 
the performance gap may widen depending on the size of the tensor. Indeed, to factor an arbitrary ten- 
sor of size Ii X • • • X Jjv the LP update for the i th factor matrix would be an optimization problem 
over 2 Ylnjti ^ n + ^ parameters. In contrast, the i th factor matrix update consists of Ii independent l\ 
minimizations over R parameters. Moreover, the independence of these minimizations present speed-up 
opportunities through parallelization. 

Our simulations demonstrated that there are non-Gaussian noise scenarios in which the quality of CPALS 
solutions suffer while those of CPAL1 tend to be insensitive to the presence of non-Gaussian noise. In 
simulation studies not shown we have seen that not all non-Gaussian perturbations cause noticeable 
degradation in the CPALS factorization. Conversely, there are situations when CPAL1 struggles as much 
as CPALS in the presence of artifact noise, e.g. when the data tensor is sparse as well. We conjecture 
that CPAL1 is most suited to handle artifact noise when the data tensor is dense. Finding an alternative 
to the 1-norm loss for sparse data with non-Gaussian noise is a direction for future research. 
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